Poster: aronsson Date: Jan 30, 2005 9:35pm
Forum: toronto Subject: Re: Universal OCR

Sorry, the formatting was lost there. What I wanted to say was this:

For example, the first line of raw OCR text of page 9 of the DjVu file UF00001842 reads:

(WORD coords="382,2455,466,2381")I(/WORD)
(WORD coords="511,2455,568,2380")is(/WORD)
(WORD coords="618,2455,660,2408")a(/WORD)
(WORD coords="705,2481,1077,2377")delightf-ul(/WORD)
(WORD coords="1132,2482,1418,2379")spring,(/WORD>
(WORD coords="1485,2456,1606,2382")the(/WORD>
(WORD coords="1652,2458,1848,2380")birds(/WORD>
(WORD coords="1901,2471,2171,2380")warble,(/WORD)

and the proofed plain text at Project Gutenberg reads:

It is a delightful spring: the birds warble,

so the words "is", "birds", and "warble" match unchanged. Is this enough for designing a utility that maps the coordinates of "delightf-ul" to the corrected word "delightful"? Would this be useful?

Poster: brewster Date: Jan 30, 2005 10:54pm
Forum: toronto Subject: Re: Universal OCR

If there were a tool to go from

djvuxml -> distributed proofreaders -> djvuxml
and preserve as many bounding boxes as possible (some of that will be difficult or impossible, so it is not that important that it have all)

Then we have a set of images-of-words and unicode-words-- or you can think of it as a training set for OCR.

We have gotten interest from the Machine Learning folks in making a universal OCR engine out of this.

What would be particularly interesting is non-roman scripts, so we may need to construct the DJVUxml more from scratch.

If anyone is interested in this, please let us know by forum post, email, or phoning the archive.